Method, system, and computer program product for analyzing combinatorial libraries

ABSTRACT

The invention provides for in silico analysis of a virtual combinatorial library. Mapping coordinates for a training subset of products in the combinatorial library, and features of their building blocks, are obtained. A supervised machine learning approach is used to infer a mapping function ƒ that transforms the building block features for each product in the training subset of products to the corresponding mapping coordinates for each product in the training subset of products. The mapping function ƒ is then encoded in a computer readable medium. The mapping function ƒ can be retrieved and used to generate mapping coordinates for any product in the combinatorial library from the building block features associated with the product.

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application is a continuation-in-part of U.S. applicationSer. No. 09/934,084, filed Aug. 22, 2001, which is incorporated byreference herein in its entirety, and it claims the benefit of U.S.Provisional Application No. 60/264,258, filed Jan. 29, 2001, and U.S.Provisional Application No. 60/274,238, filed Mar. 9, 2001, each ofwhich is incorporated by reference herein in its entirety.

FIELD OF THE INVENTION

[0002] The present invention relates to combinatorial chemistry andcomputer aided molecular design. The present invention also relates topattern analysis, information representation, information cartographyand data mining. In particular, the present invention relates togenerating mapping coordinates for products in a combinatorial chemicallibrary based on reagent data.

BACKGROUND OF THE INVENTION

[0003] Molecular similarity is one of the most ubiquitous concepts inchemistry (Johnson, M. A., and Maggiora, G. M., Concepts andApplications of Molecular Similarity, Wiley, New York (1990)). It isused to analyze and categorize chemical phenomena, rationalize thebehavior and function of molecules, and design new chemical entitieswith improved physical, chemical, and biological properties. Molecularsimilarity is typically quantified in the form of a numerical indexderived either through direct observation, or through the measurement ofa set of characteristic properties (descriptors), which are subsequentlycombined in some form of dissimilarity or distance measure. For largecollections of compounds, similarities are usually described in the formof a symmetric matrix that contains all the pairwise relationshipsbetween the molecules in the collection. Unfortunately, pairwisesimilarity matrices do not lend themselves for numerical processing andvisual inspection. A common solution to this problem is to embed theobjects into a low-dimensional Euclidean space in a way that preservesthe original pairwise proximities as faithfully as possible. Thisapproach, known as multidimensional scaling (MDS) (Torgeson, W. S.,Psychometrika 17:401-419 (1952); Kruskal, J. B., Phychometrika29:115-129 (1964)) or nonlinear mapping (NLM) (Sammon, J. W., IEEETrans. Comp. C18:401-409 (1969)), converts the data points into a set ofreal-valued vectors that can subsequently be used for a variety ofpattern recognition and classification tasks.

[0004] Given a set of k objects, a symmetric matrix, r_(y), ofrelationships between these objects, and a set of images on am-dimensional map {y_(i), i=1, 2, . . . , k; y_(i) ε

^(m)}, the problem is to place y_(i) onto the map in such a way thattheir Euclidean distances d_(ij)=∥y_(i)−y_(j)∥ approximate as closely aspossible the corresponding values r_(ij). The quality of the projectionis determined using a sum-of-squares error function known as stress,which measures the differences between d_(ij) and r_(ij) over allk(k−1)/2 possible pairs. This function is numerically minimized in orderto generate the optimal map. This is typically carried out in aniterative fashion by: (1) generating an initial set of coordinatesy_(i), (2) computing the distances d_(ij), (3) finding a new set ofcoordinates y_(i) that lead to a reduction in stress using a steepestdescent algorithm, and (4) repeating steps (2) and (3) until the changein the stress function falls below some predefined threshold. There is awide variety of MDS algorithms involving different error (stress)functions and optimization heuristics, which are reviewed in Schiffman,Reynolds and Young, Introduction to Multidimensional Scaling, AcademicPress, New York (1981); Young and Hamer, Multidimensional Scaling:History, Theory and Applications, Erlbaum Associates, Inc., Hillsdale,N.J. (1987); Cox and Cox, Multidimensional Scaling, Number 59 inMonographs in Statistics and Applied Probability, Chapman-Hall (1994),and Borg, I., Groenen, P., Modem Multidimensional Scaling,Springer-Verlag, New York, (1997). The contents of these publicationsare incorporated herein by reference in their entireties.

[0005] Unfortunately, the quadratic nature of the stress function (i.e.the fact that the computational time required scales proportionally tok²) make these algorithms impractical for data sets containing more thana few thousand items. Several attempts have been devised to reduce thecomplexity of the task. (See Chang, C. L., and Lee, R. C. T., IEEETrans. Syst., Man, Cybern., 1973, SMC-3, 197-200; Pykett, C. E.,Electron. Lett., 1978, 14, 799-800; Lee, R. C. Y., Slagle, J. R., andBlum, H., IEEE Trans. Comput., 1977, C-27, 288-292; Biswas, G., Jain, A.K., and Dubes, R. C., IEEE Trans. Pattern Anal. Machine Intell., 1981,PAMI-3(6), 701-708). However, these methods either focus on a smallsubset of objects or a small fraction of distances, and the resultingmaps are generally difficult to interpret.

[0006] Recently, two very effective alternative strategies weredescribed. The first is based on a self-organizing procedure whichrepeatedly selects subsets of objects from the set of objects to bemapped, and refines their coordinates so that their distances on the mapapproximate more closely their corresponding relationships. (U.S. Pat.No. 6,295,514, and U.S. application Ser. No. 09/073,845, filed May 7,1998, each of which is incorporated by reference herein in itsentirety). The method involves the following steps: (1) placing theobjects on the map at some initial coordinates, y_(i), (2) selecting asubset of objects, (3) revising the coordinates, y_(i), of at least someof the selected objects so that at least some of their distances,d_(ij), match more closely their corresponding relationships r_(ij), (4)repeating steps (2) and (3) for additional subsets of objects, and (4)exporting the refined coordinates, y_(i), for the entire set of objectsor any subset thereof.

[0007] The second method attempts to derive an analytical mappingfunction that can generate mapping coordinates from a set of objectfeatures. (See U.S. application Ser. No. 09/303,671, filed May 3, 1999,and U.S. application Ser. No. 09/814,160, filed Mar. 22, 2001, each ofwhich is incorporated by reference herein in its entirety). The methodworks as follows. Initially, a subset of objects from the set of objectsto be mapped and their associated relationships are selected. Thissubset of objects is then mapped onto an m-dimensional map using theself-organizing procedure described above, or any other MDS algorithm.Hereafter, the coordinates of objects in this m-dimensional map shall bereferred to as “output coordinates” or “output features”. In addition, aset of n attributes are determined for each of the selected subset ofobjects. Hereafter, these n attributes shall be referred to as “inputcoordinates” or “input features”. Thus, each object in the selectedsubset of objects is associated with an n-dimensional vector of inputfeatures and an m-dimensional vector of output features. A supervisedmachine learning approach is then employed to determine a functionalrelationship between the n-dimensional input and m-dimensional outputvectors, and that functional relationship is recorded. Hereafter, thisfunctional relationship shall be referred to as a “mapping function”.Additional objects that are not part of the selected subset of objectsmay be mapped by computing their input features and using them as inputto the mapping function, which produces their output coordinates. Themapping function can be encoded in a neural network or a collection ofneural networks.

[0008] Both the self-organizing and the neural network methods aregeneral and can be used to produce maps of any desired dimensionality.

[0009] MDS can be particularly valuable for analyzing and visualizingcombinatorial chemical libraries. A combinatorial library is acollection of chemical compounds derived from the systematic combinationof a prescribed set of chemical building blocks according to a specificreaction protocol. A combinatorial library is typically represented as alist of variation sites on a molecular scaffold, each of which isassociated with a list of chemical building blocks. Each compound (orproduct) in a combinatorial library can be represented by a uniquetuple, {r₁, r₂, . . . , r_(d)}, where r_(i) is the building block at thei-th variation site, and d is the number of variation sites in thelibrary. For example, a polypeptide combinatorial library is formed bycombining a set of chemical building blocks called amino acids in everypossible way for a given compound length (here, the number of variationsites is the number of amino acids along the polypeptide chain).Millions of products theoretically can be synthesized through suchcombinatorial mixing of building blocks. As one commentator hasobserved, the systematic combinatorial mixing of 100 interchangeablechemical building blocks results in the theoretical synthesis of 100million tetrameric compounds or 10 billion pentameric compounds (Gallopet al., “Applications of Combinatorial Technologies to Drug Discovery,Background and Peptide Combinatorial Libraries,” J. Med. Chem. 37,1233-1250 (1994), which is incorporated by reference herein in itsentirety). A computer representation of a combinatorial library is oftenreferred to as a virtual combinatorial library.

[0010] MDS can simplify the analysis of combinatorial libraries in twoimportant ways: (1) by reducing the number of dimensions that arerequired to describe the compounds in some abstract chemical propertyspace in a way that preserves the original relationships among thecompounds, and (2) by producing Cartesian coordinate vectors from datasupplied directly or indirectly in the form of molecular similarities,so that they can be analyzed with conventional statistical and datamining techniques. Typical applications of coordinates obtained with MDSinclude visualization, diversity analysis, similarity searching,compound classification, structure-activity correlation, etc. (See,e.g., Agrafiotis, D. K., The diversity of chemical libraries, TheEncyclopedia of Computational Chemistry, Schleyer, P. v. R., Allinger,N. L., Clark, T., Gasteiger, J., Kollman, P. A., Schaefer III, H. F.,and Schreiner, P. R., Eds., John Wiley & Sons, Chichester, 742-761(1998); and Agrafiotis, D. K., Myslik, J. C., and Salemme, F. R.,Advances in diversity profiling and combinatorial series design, Mol.Diversity, 4(1), 1-22 (1999), each of which is incorporated by referenceherein in its entirety).

[0011] Analyzing a combinatorial library based on the properties of theproducts (as opposed to the properties of their building blocks) isoften referred to as product-based design. Several product-basedmethodologies for analyzing virtual combinatorial libraries have beendeveloped. (See, e.g., Sheridan, R. P., and Kearsley, S. K., Using agenetic algorithm to suggest combinatorial libraries, J. Chem. Info.Comput. Sci, 35, 310-320 (1995); Weber, L., Wallbaum, S., Broger, C.,and Gubemator, K., Optimization of the biological activity ofcombinatorial compound libraries by a genetic algorithm, Angew. Chem.Int. Ed. Eng, 34, 2280-2282 (1995); Singh, J., Ator, M. A., Jaeger, E.P., Allen, M. P., Whipple, D. A., Soloweij, J. E., Chowdhary, S., andTreasurywala, A. M., Application of genetic algorithms to combinatorialsynthesis: a computational approach for lead identification and leadoptimization, J. Am. Chem. Soc., 118, 1669-1676 (1996); Agrafiotis, D.K., Stochastic algorithms for maximizing molecular diversity, J. Chem.Info. Comput. Sci., 37, 841-851 (1997); Brown, R. D., and Martin, Y. C.,Designing combinatorial library mixtures using genetic algorithms, J.Med. Chem., 40, 2304-2313 (1997); Murray, C. W., Clark, D. E., Auton, T.R., Firth, M. A., Li, J., Sykes, R. A., Waszkowycz, B., Westhead, D. R.and Young, S. C., PRO_SELECT: combining structure-based drug design andcombinatorial chemistry for rapid lead discovery. 1. Technology, J.Comput.-Aided Mol. Des., 11, 193-207 (1997); Agrafiotis, D. K., andLobanov, V. S., An efficient implementation of distance-based diversitymetrics based on k-d trees, J. Chem. Inf Comput. Sci., 39, 51-58 (1999);Gillett, V. J., Willett, P., Bradshaw, J., and Green, D. V. S.,Selecting combinatorial libraries to optimize diversity and physicalproperties, J. Chem. Info. Comput. Sci., 39, 169-177 (1999); Stanton, R.V., Mount, J., and Miller, J. L., Combinatorial library design:maximizing model-fitting compounds with matrix synthesis constraints, J.Chem. Info. Comput. Sci., 40, 701-705 (2000); and Agraflotis, D. K., andLobanov, V. S., Ultrafast algorithm for designing focused combinatorialarrays, J. Chem. Info. Comput. Sci., 40, 1030-1038 (2000), each of whichis incorporated by reference herein in its entirety).

[0012] However, as will be understood by a person skilled in therelevant art(s), this approach requires explicit enumeration (i.e.,virtual synthesis) of the products in the virtual library. This processcan be prohibitively expensive when the library contains a large numberof products. That is, the analysis cannot be accomplished in areasonable amount of time using available computing systems. In suchcases, the most common solution is to restrict attention to a smallersubset of products from the virtual library, or to consider eachvariation site independently of all the others. (See, e.g., Martin, E.J., Blaney, J. M., Siani, M. A., Spellmeyer, D. C., Wong, A. K., andMoos, W. H., J. Med Chem., 38, 1431-1436 (1995); Martin, E. J.,Spellmeyer, D. C., Critchlow, R. E. Jr., and Blaney, J. M., Reviews inComputational Chemistry, Vol. 10, Lipkowitz, K. B., and Boyd, D. B.,Eds., VCH, Weinheim (1997); and Martin, E., and Wong, A., Sensitivityanalysis and other improvements to tailored combinatorial librarydesign, J. Chem. Info. Comput. Sci., 40, 215-220 (2000), each of whichis incorporated by reference herein in its entirety). Unfortunately, thelatter approach, which is referred to as reagent-based design, oftenproduces inferior results. (See, e.g., Gillet, V. J., Willett, P., andBradshaw, J., J. Chem. Inf. Comput. Sci.; 37(4), 731-740 (1997); andJamois, E. A., Hassan, M., and Waldman, M., Evaluation of reagent-basedand product-based strategies in the design of combinatorial librarysubsets, J. Chem. Inf Comput. Sci., 40, 63-70 (2000), each of which isincorporated by reference herein in its entirety).

[0013] Hence there is a need for methods, systems, and computer programproducts that can be used to analyze large combinatorial chemicallibraries, which do not have the limitations discussed above. Inparticular, there is a need for methods, systems, and computer programproducts for rapidly generating mapping coordinates for compounds in acombinatorial library that do not require the enumeration of everypossible product in the library.

SUMMARY OF THE INVENTION

[0014] The present invention provides a method, system, and computerprogram product for generating mapping coordinates of combinatoriallibrary products from features of library building blocks.

[0015] As described herein, at least one feature is determined for eachbuilding block of a combinatorial library having a plurality ofproducts. A training subset of products is selected from the pluralityof products in the combinatorial library, and at least one mappingcoordinate is determined for each product in the training subset ofproducts. A set of building blocks is identified for each product in thetraining subset of products, and features associated with these buildingblocks are combined to form an input features vector for each product inthe training subset of products. A supervised machine learning approachis used to infer a mapping function ƒ that transforms the input featuresvector to the corresponding at least one mapping coordinate for eachproduct in the training subset of products. The mapping function ƒ isencoded in a computer readable medium. After the mapping function ƒ isinferred, it is used for determining, estimating, or generating mappingcoordinates of other products in the combinatorial library. Mappingcoordinates of other products are determined, estimated, or generatedfrom their corresponding input features vectors using the inferredmapping function ƒ. Sets of building blocks are identified for aplurality of additional products in the combinatorial library. Inputfeatures vectors are formed for the plurality of additional products.The input features vectors for the plurality of additional products aretransformed using the mapping function ƒ to obtain at least oneestimated mapping coordinate for each of the plurality of additionalproducts.

[0016] In embodiments of the invention, laboratory-measured valuesand/or computed values are used as features for the building blocks ofthe combinatorial library. In embodiments of the invention, at least oneof the features of the building blocks at a particular variation site inthe combinatorial library is the same as at least one of the features ofthe building blocks at a different variation site in the library. Inaccordance with the invention, features of building blocks representreagents used to construct the combinatorial library, fragments ofreagents used to construct the combinatorial library, and/or modifiedfragments of reagents used to construct the combinatorial library. Otherfeatures that can be used in accordance with the invention will becomeapparent to individuals skilled in the relevant arts given thedescription of the invention herein.

[0017] In an embodiment, the mapping function ƒ is implemented using aneural network. The neural network is trained to implement the mappingfunction using the input features vector and the corresponding at leastone mapping coordinate for each product of the training subset ofproducts.

[0018] In other embodiments, the mapping function ƒ is a set ofspecialized mapping functions ƒ₁ through ƒ_(n). In an embodiment, eachsuch specialized mapping function is implemented using a neural network.

[0019] In an embodiment, the mapping coordinates for the training subsetof products are obtained by generating an initial set of mappingcoordinates for the training subset of products and refining thecoordinates in an iterative manner. In an embodiment, this isaccomplished by selecting two products from the training subset ofproducts and refining the mapping coordinates of at least one of theselected products based on the coordinates of the two products and adistance between the two products. The mapping coordinates of at leastone of the selected products are refined so that the distance betweenthe refined coordinates of the two products is more representative of arelationship between the products. This process is typically repeatedfor additional products of the training subset of products until a stopcriterion is satisfied.

[0020] In another embodiment, the generation of mapping coordinates forthe training subset of products is accomplished by selecting at leastthree products from the training subset of products and refining themapping coordinates of at least some of the selected products based onthe coordinates of at least some of the selected products and at leastsome of the distances between the selected products. The mappingcoordinates of at least some of the selected products are refined sothat at least some of the distances between the refined coordinates ofat least some of the selected products are more representative ofcorresponding relationships between the products. This process istypically repeated for additional subsets of products from the trainingsubset of products until a stop criterion is satisfied.

[0021] In other embodiments, the mapping coordinates for the trainingsubset of products are generated using multidimensional scaling ornonlinear mapping so that the distances between the mapping coordinatesof the products in the training subset of products are representative ofcorresponding relationships between the products.

[0022] In other embodiments, the mapping coordinates for the trainingsubset of products are obtained from a different mapping function ƒ*. Inone embodiment, the mapping function ƒ* takes as input a set of featuresassociated with each product in the training subset of products andproduces the corresponding at least one mapping coordinate for eachproduct in the training subset of products. In another embodiment, themapping function ƒ* takes as input a set of features associated withbuilding blocks associated with each product in the training subset ofproducts and produces the corresponding at least one mapping coordinatefor each product in the training subset of products.

[0023] In other embodiments, the mapping coordinates for the trainingsubset of products are obtained from a computer readable medium.

[0024] Further embodiments, features, and advantages of the presentinvention, as well as the structure and operation of the variousembodiments of the present invention, are described in detail below withreference to the accompanying figures.

BRIEF DESCRIPTION OF THE DRAWINGS/FIGURES

[0025] The present invention is described with reference to theaccompanying drawings wherein:

[0026]FIG. 1 illustrates an example combinatorial neural networkaccording to an embodiment of the present invention;

[0027] FIGS. 2A-B illustrate a flowchart of a method for generatingcoordinates for products in a combinatorial library according to anembodiment of the present invention;

[0028] FIGS. 3A-B illustrate a flowchart of a second method forgenerating coordinates for products in a combinatorial library accordingto an embodiment of the present invention;

[0029]FIG. 4 illustrates a reaction scheme for a reductive aminationcombinatorial library;

[0030]FIG. 5A illustrates an example two-dimensional nonlinear map for areductive amination library using Kier-Hall descriptors obtained by anon-linear mapping method;

[0031]FIG. 5B illustrates an example two-dimensional nonlinear map for areductive amination library using Kier-Hall descriptors obtained by acombinatorial neural network according to the present invention;

[0032]FIG. 6A illustrates an example two-dimensional nonlinear map for areductive amination library using Isis Keys descriptors obtained by anon-linear mapping method;

[0033]FIG. 6B illustrates an example two-dimensional nonlinear map for areductive amination library using Isis Keys descriptors obtained by acombinatorial neural network according to the present invention; and

[0034]FIG. 7 illustrates an exemplary computing environment within whichthe invention can operate.

DETAILED DESCRIPTION OF THE INVENTION

[0035] Preferred embodiments of the present invention are now describedwith references to the figures, where like reference numbers indicateidentical or functionally similar elements. Also in the figures, theleft most digit(s) of each reference number corresponds to the figure inwhich the reference number is first used. While specific configurationsand arrangements are discussed, it should be understood that this isdone for illustrative purposes only. One skilled in the relevant artwill recognize that other configurations and arrangements can be usedwithout departing from the spirit and scope of the invention. It willalso be apparent to one skilled in the relevant art(s) that thisinvention can also be employed in a variety of other devices andapplications, and is not limited to just the embodiments describedherein.

[0036] 1. Overview of the Invention

[0037] The present invention provides a method, system, and computerprogram product for generating mapping coordinates of combinatoriallibrary products from features of library building blocks. In operation,features of library building blocks and mapping coordinates for atraining subset of products in the combinatorial library are obtainedand used to infer a mapping function ƒ that transforms building blockfeatures to mapping coordinates for each product in the training subsetof products. The mapping function ƒ is encoded in a computer readablemedium.

[0038] The mapping function ƒ can be retrieved and used to generatemapping coordinates for additional products in the combinatorial libraryfrom features of building blocks associated with the additionalproducts. In an embodiment, after the mapping function ƒ is inferred,mapping coordinates of additional products in the combinatorial libraryare generated by obtaining features of the building blocks and usingthem as input to the mapping function ƒ, which generates mappingcoordinates for the additional library products. The mapping coordinatescan then be used for any subsequent analysis, searching, orclassification. As will be understood by a person skilled in therelevant art, given the description herein, the present invention can beapplied to a wide variety of mapping coordinates and/or building blockfeatures.

[0039] 2. Combinatorial Neural Networks

[0040] As described below, in some embodiments of the invention themapping function ƒ is implemented using a neural network. This neuralnetwork is hereafter referred to as a combinatorial network orcombinatorial neural network. The combinatorial network is trained togenerate at least one mapping coordinate of the combinatorial libraryproducts from input features of their respective building blocks. Asused herein, the term “mapping coordinates” refers to the mappingcoordinates of the library products, and the term “building blockfeatures” refers to the input features of the library building blocks(e.g., reagents, fragments of reagents, and/or modified fragments ofreagents).

[0041] Generally speaking, a combinatorial network comprises an inputlayer containing n₁+n₂+ . . . +n_(r) neurons, where r is the number ofvariation sites in the combinatorial library and n_(r) is the number ofinput features of the building blocks at the i-th variation site. Inaddition, a combinatorial network comprises one or more hidden layerscontaining one or more neurons each, and an output layer having a singleneuron for each mapping coordinate generated by the neural network.

[0042]FIG. 1 illustrates an example combinatorial network 100 accordingto an embodiment of the invention. Combinatorial network 100 is a fullyconnected multilayer perceptron (MLP). In accordance with the invention,the outputs of combinatorial network 100 represent mapping coordinatesof the library products.

[0043] As illustrated in FIG. 1, combinatorial network 100 has an inputlayer 102, a hidden layer 104, and an output layer 106. In anembodiment, a nonlinear transfer function, such as the logistic transferfunction ƒ(x)=1/(1+e^(−x)), is used for the hidden and/or output layers.Combinatorial network 100 can be trained in accordance with theinvention using, for example, the error back-propagation algorithm (see,e.g., S. Haykin, Neural Networks, Macmillan, New York (1994), which isincorporated by reference herein in its entirety). Other neural networkarchitectures and/or training algorithms that can be used in accordancewith the invention will become apparent to individuals skilled in therelevant arts given the description of the invention herein.

[0044] As will be understood by persons skilled in the relevant artgiven the description herein, training data used to train acombinatorial network typically include two sets of parameters. Thefirst set consists of one or more input features for each of the librarybuilding blocks. The second set consists of one or more mappingcoordinates for the training subset of products. The building blockfeatures are concatenated into a single array, and are presented to thenetwork in the same order (e.g., ƒ₁₁, ƒ₁₂, . . . , ƒ_(1n1), ƒ₂₁, ƒ₂₂, .. . , ƒ_(2n2), . . . , ƒ_(r1), ƒ_(r2), . . . , ƒ_(rnr), where ƒ_(ij) isthe j-th feature of the building block at the i-th variation site).

[0045] In an embodiment, the training subset of products presented to anetwork is determined by random sampling. (See Agrafiotis, D. K., andLobanov, V. S., Nonlinear Mapping Networks. J. Chem. Info. Comput. Sci.,40, 1356-1362 (2000), which is incorporated by reference herein in itsentirety).

[0046] 3. Method Embodiments of the Invention

[0047] As described herein, the invention permits the in silicocharacterization and analysis of large virtual combinatorial libraries.A virtual combinatorial library is an electronic representation of acollection of chemical compounds or “products” generated by thesystematic combination of a number of chemical “building blocks” such asreagents according to a specific reaction protocol. Typically,embodiments of the invention are significantly faster than conventionallibrary analysis methodologies that are based on full enumeration of thecombinatorial products.

[0048] A. Example Method 200

[0049]FIGS. 2A and 2B illustrate a flowchart of the steps of a method200 for generating mapping coordinates of products in a virtualcombinatorial library based on features of corresponding buildingblocks. Typically, distances between the mapping coordinates of productsrepresent relationships between the products.

[0050] The steps of method 200 will now be described with reference toFIGS. 2A and 2B. Method 200 begins with step 202.

[0051] In step 202, mapping coordinates are obtained for a trainingsubset of products in the virtual combinatorial library.

[0052] In an embodiment of the invention, a training subset of productsfrom the virtual combinatorial library is identified in step 202.Relationships between products in the training subset of products arethen obtained and are used to produce mapping coordinates for theproducts in the training subset of products.

[0053] In an embodiment, distances between mapping coordinates of theproducts in the training subset of products are representative ofcorresponding relationships between the products.

[0054] In other embodiments, mapping coordinates for the products in thetraining subset of products are obtained in step 202 by generating aninitial set of mapping coordinates for the products in the trainingsubset of products and refining the coordinates in an iterative manneruntil a stop criterion is satisfied. This may be accomplished, forexample, by selecting two products at a time from the training subset ofproducts and refining the mapping coordinates of at least one of theselected products based on the coordinates of the two products and adistance between the two products. The mapping coordinates of at leastone of the selected products is refined so that the distance between therefined coordinates of the two products is more representative of arelationship between the products. This mapping process is furtherdescribed in U.S. Pat. No. 6,295,514, and U.S. application Ser. No.09/073,845, filed May 7, 1998.

[0055] In other embodiments, the generation of mapping coordinates forthe products in the training subset of products is accomplished byselecting at least three products from the training subset of productsand refining the mapping coordinates of at least some of the selectedproducts based on the coordinates of at least some of the selectedproducts and at least some of the distances between the selectedproducts. The mapping coordinates of at least some of the selectedproducts are refined so that at least some of the distances between therefined coordinates of at least some of the selected products are morerepresentative of corresponding relationships between the products. Thisprocess is typically repeated for additional subsets of products fromthe training subset of products until a stop criterion is satisfied.This mapping process is further described in U.S. Pat. No. 6,295,514,and U.S. application Ser. No. 09/073,845, filed May 7, 1998.

[0056] In other embodiments, the mapping coordinates for the trainingsubset of products are generated using multidimensional scaling ornonlinear mapping so that the distances between the mapping coordinatesof the products in the training subset of products are representative ofcorresponding relationships between the products.

[0057] In other embodiments, the mapping coordinates for the trainingsubset of products are obtained from a different mapping function ƒ*. Inone embodiment, the mapping fimction ƒ* takes as input a set of featuresassociated with each product in the training subset of products andproduces the corresponding at least one mapping coordinate for eachproduct in the training subset of products. In another embodiment, themapping function ƒ* takes as input a set of features associated withbuilding blocks associated with each product in the training subset ofproducts and produces the corresponding at least one mapping coordinatefor each product in the training subset of products.

[0058] In an embodiment, relationships between products in the trainingsubset of products are obtained by obtaining a set of properties foreach product in the training subset of products, and computingrelationships between products using the properties of the trainingsubset of products. As will be understood by persons skilled in therelevant art, any relationship measure that can relate products in thetraining subset of products can be used in this regard. In anembodiment, relationships between products represent similarities ordissimilarities between the products.

[0059] In other embodiments, mapping coordinates for the products in thetraining subset of products are obtained by obtaining a set ofproperties for each product in the training subset of products, andcomputing a set of latent coordinates from at least some of theproperties of the training subset of products using a dimensionalityreduction method.

[0060] In other embodiments, the mapping coordinates for the trainingsubset of products are obtained in step 202, for example, by retrievingthe mapping coordinates from a computer readable medium.

[0061] Other means that can be used in accordance with the invention toobtain mapping coordinates for the training subset of products willbecome apparent to individuals skilled in the relevant arts given thedescription of the invention herein.

[0062] In step 204, building block features (i.e., numericalrepresentations of the building blocks of the combinatorial library) areobtained for the training subset of products. These building blockfeatures can be obtained in any desired manner. Furthermore, there is norequirement in step 204 to obtain the same type of numericalrepresentations for the library building blocks as those obtained instep 202 for the training subset of products.

[0063] In embodiments of the invention, laboratory-measured valuesand/or computed values are used as features for the building blocks ofthe combinatorial library. In embodiments of the invention, at least oneof the features of the building blocks at a particular variation site inthe combinatorial library is the same as at least one of the features ofthe building blocks at a different variation site in the library.

[0064] In accordance with the invention, features of building blocksrepresent reagents used to construct the combinatorial library,fragments of reagents used to construct the combinatorial library,and/or modified fragments of reagents used to construct thecombinatorial library. Other features that can be used in accordancewith the invention will become apparent to individuals skilled in therelevant arts given the description of the invention herein.

[0065] In step 206, a supervised machine learning approach is used toinfer a mapping finction ƒ that transforms the building block featuresfor each product in the training subset of products to the correspondingmapping coordinates for each product in the training subset of products.In embodiments of the invention, step 206 involves training acombinatorial neural network to transform the building block featuresfor each product in the training subset of products to the correspondingmapping coordinates for each product in the training subset of products.

[0066] In step 208, the mapping function ƒ is encoded in a computerreadable medium, whereby the mapping function ƒ is useful for generatingmapping coordinates for additional products in the combinatorial libraryfrom building block features associated with the additional products. Inone embodiment of the invention, the mapping function ƒ is implementedin step 208 using a neural network. In other embodiments, the mappingfunction ƒ is implemented in step 208 using a set of specialized mappingfunctions ƒ₁ through ƒ_(n). In some embodiments, each such specializedmapping function is implemented using a neural network. Other methodscan also be used to implement the mapping function ƒ In embodiments ofthe invention, step 208 is performed in conjunction with step 206.

[0067] It accordance with the invention, the encoded mapping function ƒmay be distributed and used by individuals to analyze virtualcombinatorial libraries. In embodiments, the encoded mapping function ƒis distributed as a part of a computer program product. These computerprogram products can be used to perform optional step 210.

[0068] In optional step 210, building blocks features for at least oneadditional product in the combinatorial library are provided to themapping function ƒ, wherein the mapping function ƒ outputs mappingcoordinates for the additional product. The mapping coordinates producedby the mapping function ƒ can be used, for example, to analyze, search,or classify additional products in the combinatorial library. Whenperformed, optional step 210 can be performed by the same person orlegal entity that performed steps 202-208, or by a different person orentity.

[0069] B. Example Method 300

[0070]FIGS. 3A and 3B illustrate a flowchart of the steps of a secondmethod 300 for generating coordinates for products in a virtualcombinatorial library according to an embodiment of the invention. Aswill become apparent from the description, method 300 includes acombinatorial network training phase (steps 302, 304, 306, 308, and 310)and an optional product mapping coordinate generation phase (steps 312and 314). The steps of method 300 will now be described with referenceto FIGS. 3A and 3B.

[0071] In step 302 of method 300, a subset of training products {p_(i),i=1, 2, . . . , k; p_(i) ε P} is selected from a combinatorial libraryP.

[0072] The training subset of products {p_(i), i=1, 2, . . . , k; p_(i)ε P} selected in step 302 can be chosen in any manner. For example, thetraining subset can be chosen randomly or non-randomly. In most cases,the composition of a particular training subset does not have asignificant influence on the quality of a map as long as it isrepresentative of the combinatorial library from which it is selected.Empirical evidence suggests that for moderately large combinatoriallibraries (˜10⁵ products), a training subset on the order of 1-3% isusually sufficient to train a combinatorial network according to theinvention.

[0073] In step 304 of method 300, features of choice are computed foreach reagent or building block in the combinatorial library P, {ƒ_(ijk),i=1, 2, . . . , r; j=1, 2, . . . , r_(i); k=1, 2, . . . , n_(i)}, wherer is the number of variation sites in the combinatorial library, r_(i)is the number of building blocks at the i-th variation site, and n_(i)is the number of features used to characterize each building block atthe i-th variation site. At least one feature is computed for eachreagent or building block. Features computed for the reagents orbuilding blocks at a particular variation site in the combinatoriallibrary may not be the same as features computed for the building blocksat different variation sites in the combinatorial library. Inembodiments of the invention, at least some of the reagent or buildingblock features represent latent variables derived from other reagent orbuilding block data, such as principal components, principal factors,MDS coordinates, etc.

[0074] In step 306, the training subset of products selected in step 302is mapped onto

^(m) using a nonlinear mapping algorithm (p_(i)→y_(i), i=1, 2, . . . ,k, y_(i) ε

^(m)) and a function of choice for assigning relationships betweenproducts.

[0075] This function takes as input a pair of products or dataassociated with a pair of products and returns a numerical value thatrepresents a relationship (similarity, dissimilarity, or some other typeof relationship) between the products.

[0076] In embodiments of the invention, the nonlinear mapping algorithm(p_(i) →y_(i), i=1, 2, . . . , k, y_(i), ε

^(m)) is any conventional multidimensional scaling or nonlinear mappingalgorithm. In other embodiments, the nonlinear mapping algorithm(p_(i)→y_(i), i=1, 2, . . . , k, y_(i) ε

^(m)) comprises the following steps to determine each y_(i); (1) placingthe training subset of products on an m-dimensional map at some initialcoordinates; (2) selecting a pair of products from the training subsetof products having a known or assigned relationship; (3) revising themapping coordinates of one or both of the selected products based ontheir assigned relationship and the corresponding distance between theproducts on the map so that the distance between the products on them-dimensional map are more representative of the assigned relationshipbetween the products; and (4) repeating steps (2) and (3) for additionalpairs of products from the training subset of products until a stopcriterion is satisfied. This mapping process is further described inU.S. Pat. No. 6,295,514, and U.S. application Ser. No. 09/073,845, filedMay 7, 1998.

[0077] In other embodiments of the invention, the nonlinear mappingalgorithm (p_(i)→y_(i), i=1, 2, . . . , k, y_(i), ε

^(m)) comprises the following steps to determine each y_(i): (1) placingthe training subset of products on an m-dimensional map at some initialcoordinates; (2) selecting at least three products having at least someknown or assigned relationships; (3) revising mapping coordinates of atleast some of the selected products so that at least some of thedistances between the refined coordinates of at least some of theselected products are more representative of corresponding relationshipsbetween the products; and (4) repeating steps (2) and (3) for additionalsubsets of products from the training subset of products until a stopcriterion is satisfied. This mapping process is further described inU.S. Pat. No. 6,295,514, and U.S. application Ser. No. 09/073,845, filedMay 7, 1998.

[0078] In step 308 of method 300, for each product p_(i) of the trainingsubset of products, the corresponding reagents or building blocks{t_(y),j=1, 2, . . . , r} of product p_(i) are identified and theirfeatures ƒ_(t) _(il) ₁, ƒ_(t) _(il) ₂, . . . , ƒ_(t) _(il) _(n) ₁ , . .. , ƒ_(t) _(ir) _(n) _(r) are concatenated into a single vector, x_(i).A training set T={(x_(i), y_(i)), i=1, 2, . . . , k} is typicallydenoted.

[0079] In step 310, a combinatorial network is trained to reproduce themapping x_(i)→y_(i) using the input/output pairs of the training set T.In embodiments of the invention, the combinatorial network and itsassociated parameters can be exported for use by other systems and/orcomputer program products.

[0080] Step 310 ends when the combinatorial network is trained. Once thenetwork is trained, the network can be used to generate mappingcoordinates for products of combinatorial library P in accordance withsteps 312 and 314 of method 300.

[0081] In step 312, for each product {p_(z), Z=1, 2, . . . w} of thecombinatorial library P to be mapped onto

^(m), corresponding reagents or building blocks {t_(j), j=1, 2, . . . ,r} are identified, and their features ƒ_(t) _(i1) ₁, ƒ_(t) _(i1) ₂, . .. , ƒ_(t) _(i1) _(n) ₁ , . . . , ƒ_(t) _(ir) _(n) _(r) are concatenatedinto a single vector, x_(z). The features of step 312 are the featurescomputed in step 304.

[0082] In step 314, the trained combinatorial network is used to mapx_(z)→y_(z), wherein y_(z) represents mapping coordinates for productp_(z).

[0083] In embodiments of the invention, the mapping coordinates producedby the combinatorial network are stored for subsequent retrieval andanalysis. The mapping coordinates can be analyzed, for example, usingconventional statistical and/or data mining techniques. The mappingcoordinates of the products can also be used, for example, to generate asimilarity plot of the products for viewing on a display screen. Othermethods for analyzing the mapping coordinates of the products will beknown to a person skilled in the relevant art given the descriptionherein.

[0084] 4. Exemplary Applications of the Invention

[0085] In this section, two exemplary applications of the presentinvention are presented. Both of these applications illustrate thegeneration of 2-dimensional mapping coordinates for the products of acombinatorial library given a set of computed descriptors (properties)of the library products and a molecular similarity function evaluated onthe basis of these descriptors. The objective was to map the products inthe combinatorial library onto a 2-dimensional map in such a way thatthe Euclidean distances of the products on the 2-dimensional mapapproximated as closely as possible the corresponding dissimilarities ofthe respective products. Thus, the computed dissimilarities of theproducts were used as a measure of the relationships between theproducts. The two exemplary applications differ in the function that wasused to measure the dissimilarity between two products in the virtuallibrary.

[0086]FIG. 4 illustrates the reductive amination reaction scheme 400that was used to generate the combinatorial library used in theexemplary applications. In accordance with reaction scheme 400, avirtual library of 90,000 products was generated using a set of 300primary amines and 300 aldehydes. A set of 300 primary amines and 300aldehydes (i.e., 600 reagents or building blocks) were selected from theAvailable Chemicals Directory (a database of commercially availablereagents marketed by MDL Information Systems, Inc., 140 Catalina Street,San Leandro, Calif. 94577, which is incorporated by reference herein inits entirety) and used in accordance with reaction scheme 400 togenerate a library of 90,000 products.

[0087] Each of the 600 reagents and 90,000 products was described by twosets of descriptors: (1) Kier-Hall topological indices (KH), and (2)ISIS keys (IK). The former is a collection of 117 molecular connectivityindices, kappa shape indices, subgraph counts, information-theoreticindices, Bonchev-Trinajstićindices, and topological state indices. Thelatter are 166-dimensional binary vectors, where each bit encodes thepresence or absence of a particular structural feature in the molecule.The bit assignment is based on the fragment dictionary used in the ISISchemical database management system.

[0088] To eliminate redundancy in the data, the Kier-Hall (KH)descriptors for the reagents and products were independently normalizedand decorrelated using principal component analysis (PCA). This processresulted in an orthogonal set of 24 and 23 latent variables for thereagents and products, respectively, which accounted for 99% of thetotal variance in the respective data. To simplify the input to theneural networks, PCA was also applied to the binary ISIS keys, resultingin 66 and 70 principal components for the reagents and products,respectively.

[0089] In the case of the KH descriptors, the dissimilarity between twoproducts was measured using the Euclidean distance in the 23-dimensionalspace formed by the products principal components. For the ISIS keys,the dissimilarity between two products was measured using the Tanimotodistance:

S=1−T

[0090] where T is the Tanimoto coefficient:$T = \frac{{{AND}\left( {x,y} \right)}}{{{IOR}\left( {x,y} \right)}}$

[0091] and where x and y represent two binary encoded molecules, AND isthe bitwise “and” operation (a bit in the result is set if both of thecorresponding bits in the two operands are set), and IOR is the bitwise“inclusive or” operation (a bit in the result is set if either or bothof the corresponding bits in the two operands are set).

[0092] In the exemplary applications described herein, the training setwas determined by random sampling. Thus, the analysis consisted of thefollowing steps. First, a set of descriptors were computed for each ofthe reagents that make up the virtual library. A random sample of 3,000products (the training subset of products) from the virtual library wasthen identified, and mapped to two dimensions using the pairwiserefinement method described above. This method starts by assigninginitial mapping coordinates to the training subset of products, and thenrepeatedly selects two products from the training subset of products andrefines their coordinates on the map so that the distance between thecoordinates on the map corresponds more closely to the relationship(dissimilarity) between the selected products. This process terminateswhen a stop criterion is satisfied. The resulting coordinates were usedas input to a combinatorial network, which was trained to reproduce themapping coordinates of the products in the training subset of productsfrom the descriptors of their respective building blocks. Once trained,the neural network was used in a feed-forward manner to map theremaining products in the virtual library. For comparison, the mapderived by applying the pairwise refinement process on the entirevirtual library was also obtained. These reference maps are shown inFIG. 5A and 6A for the KH and IK descriptors, respectively.

[0093] The results discussed herein were obtained using three-layer,fully connected neural networks according to the invention. The neuralnetworks were trained using a standard error back-propagation algorithm(see, e.g., S. Haykin, Neural Networks, Macmillan, New York (1994)). Thelogistic transfer function ƒ(x)=1/(1+e^(−e)) was used for both hiddenand output layers. Each network had 10 hidden units and was trained for500 epochs, using a linearly decreasing learning rate from 1.0 to 0.01and a momentum of 0.8. During each epoch, the training patterns werepresented to the network in a randomized order.

[0094] For the KH maps, the input to the neural network consisted of thereagent principal components that accounted for 99% of the totalvariance in the reagent KH descriptors. For the IK maps, the input tothe neural network consisted of the reagent principal components thataccounted for 99% of the total variance in the reagent IK binarydescriptors.

[0095] The maps obtained with the combinatorial networks trained usingthe aforementioned procedure are illustrated in FIG. 5B and 6B for theKH and IK descriptors, respectively. As illustrated in FIGS. 5A-B and6A-B, in both cases, combinatorial networks trained according to theinvention produced maps that were comparable to those derived by fullyenumerating the entire combinatorial library (FIG. 5A and 6A,respectively). A more detailed study of the effects of network topologyand training parameters, sample size, sample composition, structurerepresentation, input and output dimensionality, and combinatorialcomplexity is described in (Agrafiotis, D. K., and Lobanov, V. S.,Multidimensional Scaling of Combinatorial Libraries without ExplicitEnumeration, J. Comput. Chem., 22, 1712-1722 (2001)), which isincorporated herein by reference in its entirety.

[0096] Although the preceding examples focus on 2-dimensionalprojections, the invention can also be used for mapping products intohigher dimensions in order to facilitate their analysis by establishedstatistical methods. Martin et al, for example, has used this techniqueto convert binary molecular fingerprints into Cartesian vectors so thatthey could be used for reagent selection using D-optimal experimentaldesign (See, e.g., Martin, E., and Wong, A., Sensitivity analysis andother improvements to tailored combinatorial library design. J. Chem.Info. Comput. Sci., 40, 215-220 (2000), which is incorporated byreference herein in its entirety).

[0097] 5. System and Computer Program Product Embodiments

[0098] As will be understood by a person skilled in the relevant artsgiven the description herein, the method embodiments of the inventiondescribed above can be implemented as a system and/or a computer programproduct. FIG. 7 shows an example computer system 700 that supportsimplementation of the present invention. The present invention may beimplemented using hardware, software, firmware, or a combinationthereof. It may be implemented in a computer system or other processingsystem. The computer system 700 includes one or more processors, such asprocessor 704. The processor 704 is connected to a communicationinfrastructure 706 (e.g., a bus or network). Various softwareembodiments can be described in terms of this exemplary computer system.After reading this description, it will become apparent to a personskilled in the relevant art how to implement the invention using othercomputer systems and/or computer architectures.

[0099] Computer system 700 also includes a main memory 708, preferablyrandom access memory (RAM), and may also include a secondary memory 710.The secondary memory 710 may include, for example, a hard disk drive 712and/or a removable storage drive 714, representing a floppy disk drive,a magnetic tape drive, an optical disk drive, etc. The removable storagedrive 714 reads from and/or writes to a removable storage unit 718 in awell-known manner. Removable storage unit 718 represents a floppy disk,magnetic tape, optical disk, etc. As will be appreciated, the removablestorage unit 718 includes a computer usable storage medium having storedtherein computer software and/or data. In an embodiment of theinvention, removable storage unit 718 can contain input data to beprojected.

[0100] Secondary memory 710 can also include other similar means forallowing computer programs or input data to be loaded into computersystem 700. Such means may include, for example, a removable storageunit 722 and an interface 720. Examples of such may include a programcartridge and cartridge interface (such as that found in video gamedevices), a removable memory chip (such as an EPROM, or PROM) andassociated socket, and other removable storage units 722 and interfaces720, which allow software and data to be transferred from the removablestorage unit 722 to computer system 700.

[0101] Computer system 700 may also include a communications interface724. Communications interface 724 allows software and data to betransferred between computer system 700 and external devices. Examplesof communications interface 724 may include a modem, a network interface(such as an Ethernet card), a communications port, a PCMCIA slot andcard, etc. Software and data transferred via communications interface724 are in the form of signals 728 which may be electronic,electromagnetic, optical or other signals capable of being received bycommunications interface 724. These signals 728 are provided tocommunications interface 724 via a communications path (i.e., channel)726. This channel 726 carries signals 728 and may be implemented usingwire or cable, fiber optics, a phone line, a cellular phone link, an RFlink and other communications channels. In an embodiment of theinvention, signals 728 can include input data to be projected.

[0102] Computer programs (also called computer control logic) are storedin main memory 708 and/or secondary memory 710. Computer programs mayalso be received via communications interface 724. Such computerprograms, when executed, enable the computer system 700 to perform thefeatures of the present invention as discussed herein. In particular,the computer programs, when executed, enable the processor 704 toperform the features of the present invention. Accordingly, suchcomputer programs represent controllers of the computer system 700.

[0103] 6. Conclusion

[0104] While various embodiments of the present invention have beendescribed above, it should be understood that they have been presentedby way of example, and not limitation. It will be apparent to personsskilled in the relevant art that various changes in detail can be madetherein without departing from the spirit and scope of the invention.Thus the present invention should not be limited by any of theabove-described exemplary embodiments, but should be defined only inaccordance with the following claims and their equivalents.

What is claimed is:
 1. A method for generating coordinates for productsin a combinatorial library based on features of corresponding buildingblocks, wherein distances between the coordinates representrelationships between the products, the method comprising the steps of:(1) obtaining mapping coordinates for a subset of products in thecombinatorial library; (2) obtaining building block features for thesubset of products in the combinatorial library; (3) using a supervisedmachine learning approach to infer a mapping function ƒ that transformsthe building block features for each product in the subset of productsto the corresponding mapping coordinates for each product in the subsetof products; and (4) encoding the mapping function ƒ in a computerreadable medium, whereby the mapping function ƒ is useful for generatingcoordinates for additional products in the combinatorial library frombuilding block features associated with the additional products.
 2. Themethod according to claim 1, further comprising the step of: (5)providing building blocks features for at least one additional productto the mapping function ƒ, wherein the mapping function ƒ outputsgenerated mapping coordinates for the additional product.
 3. The methodaccording to claim 1, wherein step (1) comprises generating the mappingcoordinates for the subset of products.
 4. The method according to claim3, wherein step (1) further comprises the steps of: (a) generating aninitial set of mapping coordinates for the subset of products; (b)selecting two products from the subset of products; (c) refining themapping coordinates of at least one product selected in step (1)(b)based on the coordinates of the two products and a distance between thetwo products so that the distance between the refined coordinates of thetwo products is more representative of the relationship between theproducts; and (d) repeating steps (1)(b) and (1)(c) for additionalproducts until a stop criterion is obtained.
 5. The method according toclaim 1, wherein step (1) comprises calculating the mapping coordinatesfor the subset of products using a dimensionality reduction algorithm.6. The method according to claim 1, wherein step (1) comprisesretrieving the mapping coordinates for the subset of products from acomputer readable medium.
 7. The method according to claim 1, whereinstep (2) comprises the step of: using a laboratory measured value as afeature for each building block in at least one variation site in thecombinatorial library.
 8. The method according to claim 1, wherein step(2) comprises the step of: using a computed value as a feature for eachbuilding block in at least one variation site in the combinatoriallibrary.
 9. The method according to claim 1, wherein at least some ofthe building block features represent reagents used to construct thecombinatorial library.
 10. The method according to claim 1, wherein atleast some of the building block features represent fragments ofreagents used to construct the combinatorial library.
 11. The methodaccording to claim 1, wherein at least some of the building blockfeatures represent modified fragments of reagents used to construct thecombinatorial library.
 12. The method according to claim 1, wherein themapping function ƒ is encoded as a neural network.
 13. The methodaccording to claim 1, wherein the mapping function ƒ is a set ofspecialized mapping functions ƒ₁ through ƒ_(n), each encoded as a neuralnetwork.
 14. A system for generating coordinates for products in acombinatorial library based on features of corresponding buildingblocks, wherein distances between the coordinates representsimilarity/dissimilarity of the products, comprising: means forobtaining mapping coordinates for a subset of products in thecombinatorial library; means for obtaining building block features forthe subset of products in the combinatorial library; means for using asupervised machine learning approach to infer a mapping function ƒ thattransforms the building block features for each product in the subset ofproducts to the corresponding mapping coordinates for each product inthe subset of products; and means for encoding the mapping function ƒ ina computer readable medium, whereby the mapping function ƒ is useful forgenerating coordinates for additional products in the combinatoriallibrary from building block features associated with the additionalproducts.
 15. The system of claim 14, further comprising: means forproviding building blocks features for at least one additional productto the mapping function ƒ, wherein the mapping function ƒ outputsgenerated mapping coordinates for the additional product.
 16. The systemof claim 14, wherein said means for obtaining mapping coordinatescomprises: means for generating an initial set of mapping coordinatesfor the subset of products; means for selecting two products from thesubset of products; means for refining the mapping coordinates of atleast one product selected based on the coordinates of the two productsand a distance between the two products so that the distance between therefined coordinates of the two products is more representative of therelationship between the products; and means for continuously selectingtwo products at a time and refining the mapping coordinates of at leastone product selected until a stop criterion is obtained.
 17. The systemof claim 14, wherein a laboratory measured value is used as a featurefor each building block in at least one variation site in thecombinatorial library.
 18. The system of claim 14, wherein a computedvalue is used as a feature for each building block in at least onevariation site in the combinatorial library.
 19. The system of claim 14,wherein at least some of the building block features represent reagentsused to construct the combinatorial library.
 20. The system of claim 14,wherein at least some of the building block features represent fragmentsof reagents used to construct the combinatorial library.
 21. The systemof claim 14, wherein at least some of the building block featuresrepresent modified fragments of reagents used to construct thecombinatorial library.
 22. The system of claim 14, wherein the mappingfinction ƒ is encoded as a neural network.
 23. The system of claim 14,wherein the mapping function ƒ is a set of specialized mapping functionsƒ₁ through ƒ_(n), each encoded as a neural network.
 24. A computerprogram product for generating coordinates for products in acombinatorial library based on features of corresponding buildingblocks, wherein distances between the coordinates representsimilarity/dissimilarity of the products, said computer program productcomprising a computer useable medium having computer program logicrecorded thereon for controlling a processor, said computer programlogic comprising: a procedure that enables said processor to obtainmapping coordinates for a subset of products in the combinatoriallibrary; a procedure that enables said processor to obtain buildingblock features for the subset of products in the combinatorial library;a procedure that enables said processor to use a supervised machinelearning approach to infer a mapping function ƒ that transforms thebuilding block features for each product in the subset of products tothe corresponding mapping coordinates for each product in the subset ofproducts; and a procedure that enables said processor to encode themapping function ƒ in a computer readable medium, whereby the mappingfunction ƒ is useful for generating coordinates for additional productsin the combinatorial library from building block features associatedwith the additional products.
 25. The computer program product of claim24, further comprising: a procedure that enables said processor toprovide building blocks features for at least one additional product tothe mapping function ƒ, wherein the mapping function ƒ outputs generatedmapping coordinates for the additional product.
 26. The computer programproduct of claim 24, wherein said procedure that enables said processorto obtain mapping coordinates comprises: a procedure that enables saidprocessor to generate an initial set of mapping coordinates for thesubset of products; a procedure that enables said processor to selecttwo products from the subset of products; a procedure that enables saidprocessor to refine the mapping coordinates of at least one productselected based on the coordinates of the two products and a distancebetween the two products so that the distance between the refinedcoordinates of the two products is more representative of therelationship between the products; and a procedure that enables saidprocessor to continue selecting two products at a time and refining themapping coordinates of at least one product selected until a stopcriterion is obtained.
 27. The computer program product of claim 24,wherein a laboratory measured value is used as a feature for eachbuilding block in at least one variation site in the combinatoriallibrary.
 28. The computer program product of claim 24, wherein acomputed value is used as a feature for each building block in at leastone variation site in the combinatorial library.
 29. The computerprogram product of claim 24, wherein at least some of the building blockfeatures represent reagents used to construct the combinatorial library.30. The computer program product of claim 24, wherein at least some ofthe building block features represent fragments of reagents used toconstruct the combinatorial library.
 31. The computer program product ofclaim 24, wherein at least some of the building block features representmodified fragments of reagents used to construct the combinatoriallibrary.
 32. The computer program product of claim 24, wherein themapping function ƒ is encoded as a neural network.
 33. The computerprogram product of claim 24, wherein the mapping function ƒ is a set ofspecialized mapping functions ƒ₁ through ƒ_(n), each encoded as a neuralnetwork.
 34. A method for analyzing a combinatorial library {ƒ_(ijk),i=1, 2, . . . , r; j=1, 2, . . . , r_(i); k=1, 2, . . . , n}, wherein rrepresents the number of variation sites in the library, r_(i)represents the number of building blocks at the i-th variation site, andn represents the number of descriptors used to characterize eachreagent, the method comprising the steps of: (1) computing at least onedescriptor for each reagent of the combinatorial library; (2) selectinga training subset of products {p_(i), i=1, 2, . . . , k} of thecombinatorial library; (3) mapping the training subset of products onto

^(m) using a nonlinear mapping algorithm (p_(i)→y_(i), i=1, 2, . . . ,k, y_(i) ε

^(m)); (4) identifying, for each product p_(i) of the training subset ofproducts, corresponding reagents {t_(ij), j=1, 2, . . . , r} andconcatenating their descriptors ƒ_(1t) _(i1) ,ƒ_(2t) _(i2) , . . . ,ƒ_(rt) _(ir) into a single vector, x_(i); (5) training a combinatorialnetwork to recognize the mapping x_(i)→y_(i) using input/output pairs oftraining set T={(x_(i), y_(i)), i=1, 2, . . . , k}; (6) identifying,after the combinatorial network is trained, for each product {p_(z),z=1, 2, . . . w} of the combinatorial library to be mapped onto

^(m), corresponding reagents {t_(j), j=1, 2, . . . , r} andconcatenating their descriptors, ƒ_(1t) ₁ , ƒ_(2t) ₂ , . . . , ƒ_(rt)_(r) , into a single vector, x_(z); and (7) mapping x_(z)→y_(z) usingthe trained combinatorial network, wherein y_(z) represents generatedcoordinates for product p_(z).
 35. The method of claim 34, wherein step(2) comprises: selecting the training subset of products randomly. 36.The method of claim 1, wherein step (3) comprises: (a) placing theselected training subset of products on an m-dimensional nonlinear mapusing randomly assigned coordinates; (b) selecting a pair of theproducts having a similarity relationship; (c) revising the coordinatesof at least one of the selected pair of products based on the similarityrelationship and the corresponding distance between the products on thenonlinear map; and (d) repeating steps (b) and (c) for additional pairsof the products until the distances between the products on them-dimensional nonlinear map are representative of the similarityrelationships between the products.
 37. The method of claim 34, furthercomprising the step of: storing an output of the trained combinatorialnetwork on a computer readable storage device.